Clinical diagnosis of metabolic disorders using untargeted metabolomic profiling and disease-specific networks learned from profiling data

Untargeted metabolomics is a global molecular profiling technology that can be used to screen for inborn errors of metabolism (IEMs). Metabolite perturbations are evaluated based on current knowledge of specific metabolic pathway deficiencies, a manual diagnostic process that is qualitative, has limited scalability, and is not equipped to learn from accumulating clinical data. Our purpose was to improve upon manual diagnosis of IEMs in the clinic by developing novel computational methods for analyzing untargeted metabolomics data. We employed CTD, an automated computational diagnostic method that “connects the dots” between metabolite perturbations observed in individual metabolomics profiling data and modules identified in diseasespecific metabolite co-perturbation networks learned from prior profiling data. We also extended CTD to calculate distances between any two individuals (CTDncd) and between an individual and a disease state (CTDdm), to provide additional network-quantified predictors for use in diagnosis. We show that across 539 plasma samples, CTD-based network-quantified measures can reproduce accurate diagnosis of 16 different IEMs, including adenylosuccinase deficiency, argininemia, argininosuccinic aciduria, aromatic l-amino acid decarboxylase deficiency, cerebral creatine deficiency syndrome type 2, citrullinemia, cobalamin biosynthesis defect, GABA-transaminase deficiency, glutaric acidemia type 1, maple syrup urine disease, methylmalonic aciduria, ornithine transcarbamylase deficiency, phenylketonuria, propionic acidemia, rhizomelic chondrodysplasia punctata, and the Zellweger spectrum disorders. Our approach can be used to supplement information from biochemical pathways and has the potential to significantly enhance the interpretation of variants of uncertain significance uncovered by exome sequencing. CTD, CTDdm, and CTDncd can serve as an essential toolset for biological interpretation of untargeted metabolomics data that overcomes limitations associated with manual diagnosis to assist diagnosticians in clinical decision-making. By automating and quantifying the interpretation of perturbation patterns, CTD can improve the speed and confidence by which clinical laboratory directors make diagnostic and treatment decisions, while automatically improving performance with new case data.


Methods
Data collection. Data used represent a meta-analysis of untargeted metabolomics plasma samples collected from previously reported studies [5][6][7][8][9][10]22,23 , as well as previously unreported samples (Table 1, Fig. 1). All study procedures were approved by the Institutional Review Board (IRB) of the Baylor College of Medicine and complied with all relevant guidelines and regulations. For some of the studies, informed consent was obtained and for others, it was waived by the Baylor College of Medicine's IRB-approved waiver of informed consent. All sample data were de-identified. While all sample data were processed similarly 24 , some data differ in sample source (e.g., heparin vs. EDTA plasma) and platform specifications (e.g., mass analyzer) ( Table 1). Sample data were generated by Baylor Genetics in collaboration with Metabolon, Inc. (Morrisville, NC) on referred clinical or research samples. Twenty-one research samples were collected at RhizoKids International family conference for people affected by rhizomelic chondrodysplasia punctata 25 (RhizoKids International, rhizokids.com). A total of 539 profiled plasma samples were included in this study, including 414 samples from existing studies and 125 previously unpublished samples (see Table 1). Across all 539 samples, there was a range of 376-684 named, z-scored compounds and a range of 0-261 unnamed ("unknown"), z-scored compounds included for each sample. Compounds rarely present in normal reference blood samples cannot be z-scored. However, argininosuccinate only Table 1. Description of data and data sources. For each disease cohort, the number of samples and the publication source of the data are described. Individuals in the Unknown cohort 22 , a cohort whose clinical diagnoses from which we were blinded, may also have diagnoses associated with any of the disease cohorts listed. When samples in the Unknown cohort were identical to samples associated with a known clinical diagnosis, Alaimo et al. 22 is also referenced. For samples from Miller et al. 5 , genetic sequencing data was not available and thus, only biochemical diagnoses were made. As a result, when more than one gene is responsible for a given diagnosis, all known genes associated with the given diagnosis are listed in the Related Genes column. GC-MS gas chromatography-mass spectrometry, LC-MS liquid chromatography-mass spectrometry, MS n multi-stage mass spectrometry.  www.nature.com/scientificreports/ presents in diseased samples and may serves as a strong predictor for disease and was thus encoded as a binary predictor ("1" for the presence or "0" for the absence). Unknown metabolites were excluded from the analysis in this work because the clinical utility of these compounds for diagnosis is limited.
Inference of disease-specific networks. Metabolites with z-scores in > 50% reference samples and > 50% disease samples were used for network learning. Missing z-scores were imputed using the minimum z-score of the analyte in a large reference population. In order to model the differences in perturbation signatures between disease cases and controls, two types of Gaussian graphical network models were then learned from the data: one from both disease and control samples (disease + control network), and a second from only control samples (control network) 21 . We used the Graphical Lasso algorithm implemented in the R package huge (v1.3.5) to estimate the precision matrix, where regularization parameter lambda is selected using criteria "stars". For both graphs, edge weights are the estimated partial correlation between any two metabolites after conditioning on all other variables in the datasets. Next, edges found in the disease + control network that were also found in the control network were pruned 21 . This pruned, "disease-specific" network represents probability of any pairs of metabolites being co-perturbed together at the state of the disease and was used in downstream analysis (Fig. 1).
Including both examples of disease and control profiles in the training data ("discriminative latent structure inference") introduces a hidden variable representing the disease state associated with each sample, allowing the network to model the specific metabolomic differences between two conditions (disease vs. control). Disease-specific network models from five IEMs in Thistlethwaite et al. 21 were included, as well as novel network models learned on 11 additional IEMs, totaling 16 IEM disease states (Table S2). All network structures used in this paper are accessible through the CTDext R package accessible via GitHub (https:// github. com/ BRL-BCM/ CTDext), an extension of the CTD CRAN R package that also includes added functionalities and filesharing  www.nature.com/scientificreports/ necessary to reproduce our results. Graphical model metrics including node count, edge count and graph density for all network structures are documented in Table S2.
The CTD method. The CTD method was described previously 21 . Briefly, the method takes a weighted, disease-specific "co-perturbation" network and a set of network nodes as inputs and identifies a subset of the input nodes that is highly connected within the network ( Fig. 2A,B). The method also provides a p value corresponding to the level of connectedness of the input node subset within the network.
The CTDncd distance: a network-based patient-patient distance method. In the context of a network, two individuals' sets of metabolite perturbations can be compared by calculating the distance between them. If two metabolite sets hit the same nodes or modules in the context of a network, they may be considered related; however, if two metabolite sets share no overlapping nodes, nor do they hit proximal parts of a network, they may be considered unrelated (Fig. 2C). To calculate the distance between two node subsets in a network, we use the Normalized Compression Distance (NCD) metric 26 which is based on normalized mutual information (Eq. 1).
The CTDdm distance: a network-based patient-disease distance method, a variation on CTDncd. CTDdm also uses Eq. (1) to calculate the distance between a set of metabolites perturbed in an individual and the metabolite set perturbed in individuals with a specific disease (Fig. 1). This set, referred to as the "main disease module", is calculated by Algorithm 1 in Supplemental Text 1. Information provided by CTDdm was shown to reduce false positive rate and improve overall diagnostic accuracy when combined with the CTD method (Table S1). CTD + CTDdm metrics were therefore used for scoring sample profiles.
Analysis of exome sequencing data. The collection and processing of clinical exome sequencing data from 170 individuals is detailed in Alaimo et al. 22 . Briefly, data were acquired using protocols adapted for clinical testing, described previously 27 . Variants were called using AtlasSNP2 28 (v. 1.4.3). Variants in intronic or intergenic regions were filtered out, as well as variants found in ESP5400 or 1000 Genomes 29 at frequencies greater than 0.05. The pathogenicity of each genetic variant was assessed according to the ACMG/AMP guidelines 3,30 . (1)

Figure 2.
CTDncd and CTDdm extend CTD to quantify distances between two sets of nodes in a network.
(A) CTD outputs highly connected subsets of a node set (given as an input) in a graph (also given as an input).
(B) CTD assigns higher significance to highly connected node sets compared to sparsely connected node sets.
(C) Node sets found in identical or neighboring regions in a graph are assigned shorter distances compared to node sets found in distal regions of a graph. CTDncd calculates distances between two individuals, where node sets being compared are based on observed metabolite perturbations in two individuals' metabolomics profiles. CTDdm calculates distances between an individual and a disease state, where node sets being compared are the main disease module in a graph (see Algorithm 1 in Supplemental Text 1) and the observed metabolite perturbations in a single individual's metabolomic profile. www.nature.com/scientificreports/ For each disease gene associated with any of the 15 IEMs (X-linked OTC deficiency excluded) modeled in this paper, we assigned each individual to one of 5 classes (Table 2) based on the assumption of an autosomal recessive inheritance pattern, the ACMG/AMP pathogenicity category, and observed zygosity of the variants in individuals' exomes.
The metabolomics data portal: a diagnostics tool for untargeted metabolomics data. To provide access to the data and to provide a prototype tool to aid in the clinical diagnostic process, we developed an R shiny application. This application visualizes individuals' metabolomics data and implements the networkassisted diagnostic functionalities featured in this paper. A walkthrough of the features this application offers can be found in Supplemental Text 1. The full application can be accessed at https:// genbo ree. org/ genbo reeKB/ proje cts/ metab olomi cs-data-portal.

Results
Data-driven network models show higher accuracy relative to rule-based biomarker models and eliminate the requirement for a priori biomarkers. Prior knowledge, including lists of metabolite biomarkers, have previously been integrated into rule-based models for diagnosis in metabolomics 20 . Network-based modeling is less biased toward specific lists of biomarkers and models the co-variation between metabolites and is thus more suitable in the discovery and complex diagnostics contexts. We explored the possibility that network-based approaches may also show added accuracy associated with incorporating information from full untargeted metabolomics profiles compared to information based solely from known biomarkers of metabolic disease states. Haijes et al. 20 reported models that rank diseases for each individual based on lists of known biomarkers for each IEM. The correct diagnosis ranked first in 37% of 115 validation plasma samples. In 72% of cases, the correct diagnosis could be found within a short "differential diagnosis" (DD) list of candidate diagnoses, returned by the rule-based algorithm. To compare this rule-based method to CTD, we assigned a DD to an individual based on whether combined network-quantified scores (CTD + CTDdm) match at a Bonferroni-corrected combined network p value < 0.05 in the corresponding disease-specific network model.
As shown in Table 3, CTD + CTDdm ranked the correct diagnosis (from 16 modeled IEMs) first in 70% of 154 samples. Moreover, 89% of samples had the correct diagnosis in their DD short list. When we omitted individuals with OTC deficiency (MIM:311250)-where 13 out of 17 were female-from consideration due to diagnostic difficulties associated with the possibility of skewed X-inactivation in females with this diagnosis 31,32 , CTD + CTDdm ranked the correct diagnosis first in 79% of the remaining 137 samples with known diagnoses across the remaining 15 modeled IEMs (Table 3), and 94% of samples had the correct diagnosis in their DD short list. Prediction performance (sensitivity, specificity, accuracy) of all individual disease-specific models measured by CTD + CTDdm ranks is shown in Table S3. All disease rankings for each individual can be viewed using the Metabolomics Data Portal (see "Methods"), in the Network-assisted Diagnostics tab ( Figure S2b). Table 2. Categorization of individuals based on classification of genetic variants identified in personal genome data. For each gene known to cause a given IEM, variants identified in a personal genome were assigned a pathogenicity category based on the application of the ACMG/AMP guidelines. Secondly, the observed zygosity (e.g., heterozygous, hemizygous or homozygous) of the variants identified in an individual's exome was considered alongside the expected Mendelian mode of inheritance for the disease gene (i.e., autosomal recessive).  Table 3. Accuracy of diagnostic rankings across 188 plasma samples with known disease. A differential diagnosis list (DD) is a ranked list of potential "candidate" diagnoses for each individual. Diagnoses were added to the DD if individual sample data meet a given threshold defined by each diagnostic method. Rankings were determined and compared for both a rule-based method described in Haijes et al. 20 and for our combined network (CTD + CTDdm) approach. IEMs inborn errors of metabolism, DD differential diagnosis list. www.nature.com/scientificreports/ To determine whether the increased sensitivity comes at the cost of lower specificity, we also examined the median size of the DD short lists for 172 reference samples. Given that reference samples represent individuals without disease, a maximally specific method would assign 0 DDs to each of these individuals. In Haijes et al. 20 , reference samples identified a median of 3 out of 58 DDs. In our network-based approach, we found the median size to be 1 out of 16 DDs, verifying that our approach shows comparable specificity.
Network approaches may be used for diagnosis in place of pathway-based approaches. While multi-metabolite perturbations are often interpreted in the context of well-curated pathway knowledge, such knowledge is not always available, does not include all measurable metabolites, and is not disease-specific. We therefore asked if data-driven network models may be used in lieu of pathway-based models. The possibility of using data-driven networks is particularly relevant for metabolic diseases whose affected pathways are not fully characterized, such as various peroxisomal, mitochondrial, and seizure disorders.
Toward this purpose, we defined a pathway-based diagnostic model to be one that considers only metabolites that are a priori known to be involved in a disease-relevant pathway, without information provided by the remaining untargeted metabolomics profile. In contrast, network-based diagnostic models and full-profile models were defined to be ones that considers all frequently detected metabolites in untargeted metabolomics profiles, without a priori information about which metabolites are most relevant for diagnosis. The former utilizes CTD-based metrics, whereas the latter includes all frequently detected metabolites as predictors.
To compare the approaches directly, we used untargeted metabolomics profiling data from individuals diagnosed with any of four genetically distinct urea cycle disorders 7 , where the disease mechanism is well-characterized by defects in enzymes and perturbations of metabolites in the urea cycle pathway (Fig. 3), as well as negative control ("reference") profiles. We compared the performance of partial least squares regression models: two modeled the relative abundances of metabolites in the urea cycle; third modeled CTD-and CTDdm-quantified scores; fourth modeled all frequently detected metabolites in the untargeted metabolomics profiles. We note that argininosuccinate is rarely identified by our platforms due to its low concentration in normal plasma samples. As a consequence of its rare presentation in normal "reference" blood samples, z-scores for argininosuccinate levels cannot be generated when identified in an individual sample and thus, cannot be used as a quantitative predictor. Instead, we encoded argininosuccinate as a binary predictor ("1" for the presence or "0" for the absence of the metabolite). We, therefore, defined the "ASA-Arg-Orn-Cit" model to include quantitative variables, arginine, ornithine, and citrulline, and one binary variable, argininosuccinate. To further evaluate diagnostic accuracy, we defined the full "Pathway" model to include all metabolites found in both the urea cycle and the periphery of the urea cycle, as illustrated in Fig. 3. www.nature.com/scientificreports/ As shown in Fig. 3, both network-and pathway-based models improve diagnostic power relative to the fullprofile model. CTD performed competitively with pathway-based models of four major urea cycle disorders: argininosuccinic aciduria (MIM:207900), argininemia (MIM:207800), ornithine transcarbamylase (OTC) deficiency (MIM:311250), and citrullinemia (MIM:215700). In the case of argininosuccinic aciduria and particularly for OTC deficiency, the network-based model outperformed the ASA-Arg-Cit-Orn model and was competitive with the full Pathway model. Interestingly, model accuracies for OTC deficiency suffered from poorer discrimination compared to argininosuccinic aciduria, argininemia, and citrullinemia. This is partially due to the known phenotypic heterogeneity of effects associated with X-inactivation patterns observed in females with OTC deficiency 31,32 .
Overall, this result suggests network-based and pathway-based modeling approaches have comparable accuracies. Thus, when biochemical pathway knowledge for a particular disease state is not available, data-driven network-based models may provide a valuable alternative.
Separating treatment-related effects from disease-related effects in metabolomics data. While the heterogeneity of effects associated with X-inactivation in OTC deficiency can explain why diagnostic accuracy of OTC deficiency was lower than many other IEMs, it is also likely that treatment-related effects may be confounding the raw metabolomics data and as a result, be affecting the ability of both pathway (knowledge-driven) and network-based (data-driven) models from performing well in diagnosis. To test for treatment confounding and whether it could be removed or ameliorated, we examined all OTC deficiency samples from Burrage et al. 7 where treatment information is available. We found that 8/10 patients with OTC deficiency were taking citrulline supplements as part of a prescribed treatment regimen. To identify treatmentdriven signatures that differentiated from disease-driven signatures, we constructed a citrulline supplement-specific network by contrasting eight OTC deficiency patients taking supplemental citrulline against the remaining 20 urea cycle disorder patients. A second network was learned from OTC deficiency patients not undergoing citrulline supplementation and edges found in both networks were pruned from the first network. Similar to "main disease module" identification (Algorithm 1), we determined the representative "main treatment module" for citrulline supplementation. When we compared the main disease module identified in the OTC disease-specific network to the main treatment module identified in the citrulline supplement-specific network, we found two out of four treatment-related compounds (e.g., carnitine and pyruvate) in the original OTC deficiency main disease module (Fig. 4A).
We then asked if treatment-related signatures contributed to false positives in patients with these diseases. Notably, the citrulline supplement-related treatment module was also found to overlap perturbations in other disease states such as cobalamin biosynthesis defect (carnitine, betaine) and argininosuccinate lyase deficiency (phenylalanine and pyruvate). As a consequence, we expected diagnostic accuracy to improve by omitting these treatment-related metabolites from the disease-specific OTC network folds. One case, for instance, is argininosuccinate lyase deficiency patient "EDTA-ASLD-7", where OTC deficiency was falsely ranked first by CTD + CTDdm (combined, p = 0.00568), while the correct diagnosis fell out of the top three rankings (combined, p = 0.0385). With the updated network model, significance of OTC deficiency dropped to the 5th in the list of diagnoses (combined, p = 0.0716), and the correct diagnosis ascended to within top three. Additionally, removal of treatment signal also unflagged a cobalamin biosynthesis defect patient "EDTA-COB-6", as the significance of OTC deficiency dropped an order of magnitude (combined, p = 0.00335). Removal of treatment-related compounds also increased model AUC, specificity, and overall accuracy (Fig. 4B).
This result suggests that pruning treatment-related nodes can improve the diagnosis of patients with other diseases where compounds to be removed are affected by the disease state. Nevertheless, the pruning of treatmentrelated nodes should be performed with caution, as treatment-related signatures that are particularly specific to a www.nature.com/scientificreports/ given diagnostic category can also improve diagnostic performance. Ideally, we would only collect metabolomic data prior to administering treatment, in order to train disease-specific network models for diagnostic purposes. However, many individuals that undergo untargeted metabolomic screening are already on a treatment regimen in order to manage their symptoms, and as a result, the confounding due to treatment is common. While we have shown that censoring treatment-related metabolites from our diagnostic OTC network models helped diagnostic accuracy, this represents only a first pass attempt to separate treatment effects from disease effects. Future research may be needed to explore new strategies that would improve diagnostic accuracy by removing treatment effects.
Dissecting the genetic etiology of peroxisome biogenesis disorders using untargeted metabolomics with network models. Peroxisome biogenesis disorders (PBDs) are autosomal recessive disorders that result from the impaired assembly and biological functioning of peroxisomes 33 . They are composed of two known major classes: Zellweger spectrum disorders (ZSD, MIM:214100,601539) and rhizomelic chondrodysplasia punctata (RCDP, MIM:215100). As a whole, PBDs are complex since they are caused by deficiency of any single gene in a group of related genes and their severity is strongly influenced by the residual activity of downstream gene products. Furthermore, PBDs affect the functions of numerous organ systems, and their pathological mechanisms of disease are only partially understood. We therefore asked whether a data-driven network-based approach may help dissect their etiology and help provide diagnostic information. Untargeted metabolomic profiling from Wangler et al. 6 describe 18 individuals with ZSD originating from deleterious variants in PEX1 (MIM:602136). We collected an additional 21 samples from individuals diagnosed with a RCDP type 1 disorder with confirmed deleterious variants in PEX7 (MIM:601757). For all 39 samples, pairwise patient-patient distance calculations using CTDncd were estimated and plotted in lower dimensional space (Fig. 5). As shown, the distances accurately cluster individuals with ZSD separately from individuals with RCDP and from known reference samples. Interestingly, examination of the ZSD cluster highlighted less pronounced abnormalities in plasma metabolite levels in older individuals ( ≥ 10 years) with a smaller centroid-centroid distance to the reference cluster (d = 0.625) than that of younger patients (< 10 years of age) (d = 0.949), in agreement with Wangler et al. 6 . Overall, k-means clustering generated a cluster purity score of 0.888, which indicates low false positive and false negative clustering results. These results suggest that CTDncd can accurately distinguish between groups of individuals with ZSD and RCDP, the two major subtypes of PBDs.
Evidence from the metabolome can help interpret variants of uncertain significance. Methods for using untargeted metabolomics data to resolve the pathogenicity of VUSs are currently manual, qualitative, and can be very laborious. We therefore asked if disease-specific networks and CTD-based metrics may help automate the interpretation of untargeted metabolomics profiling data to improve variant assessment. We recently reported that manual evaluation of untargeted metabolomics data aided in the diagnosis in the context www.nature.com/scientificreports/ of exome sequencing for several IEMs 22 . In our current analysis, we interpreted the same exome sequencing and respective metabolomics data in two stages. First, as in Alaimo et al. 22 , genetic variants identified in patient exome data were classified according to the ACMG/AMP guidelines. Individuals with variants found in at least one disease gene relevant to any of the 15 IEMs (X-linked OTC excluded) modeled in our analysis were categorized into the one of 5 groups (Table 2) based on the observed variants' pathogenicity and zygosity. Second, select patient metabolomics data were then analyzed using CTD and CTDdm in the context of relevant IEM-specific networks. For all individuals with variants categorized into groups 1-3 (Table 2) in disease-relevant genes for a given disease state (Table 1), individuals with a combined CTD + CTDdm disease-specific network p value < 0.05 were reported as a HIT, and the pathogenicity of the respective variants was re-interpreted towards a more pathogenic classification. Furthermore, individuals with combined network p values between 0.05 and 0.15 were reported as a BORDERLINE HIT. Of all 29 variant confirmations or re-interpretations reported by Alaimo et al. 22 , 10 had variants in disease genes relevant to the 15 IEMs (X-linked OTC excluded) modeled in this paper. Of all 10 individuals, 9/10 were classified as HIT and 1/10 was a BORDERLINE HIT (Table 4). These results suggest that the interpretation of untargeted metabolomic profiling data to improve variant assessment can be automated using CTD-based metrics.
Of further relevance are the patterns detected by CTD + CTDdm that connect individuals to disease states and that escaped manual inspection in Alaimo et al. 22 (Table 4). Patient 10 was a 21-year-old female who had a likely pathogenic homozygous variant (NM_004813.4:c.993_995del) in PEX16 (MIM:603360). Her metabolomic profile showed several metabolite perturbations consistent with a PBD. Out of 16 diagnoses, ZSD ranked 6th and RCDP ranked 2nd. CTD detected a module containing a very long-chain fatty acid that was positively perturbed (Table 4), a hallmark of ZSD, in the ZSD disease network. Review of clinical reports revealed that Patient 10 had an intellectual disability, spasticity, ataxia and structural brain abnormalities, phenotypes consistent with those observed in individuals with PEX16 pathogenic variants, as reported previously 34 . We then confirmed that this individual was the same individual who was diagnosed with a PBD in a recent publication 35 , a diagnosis that took over 18 years to establish. We previously reported that plasma disease signatures in individuals with a mild to intermediate ZSD are more pronounced in younger subjects, suggesting studies earlier in life reveal larger biochemical changes for a number of possible reasons 6 . CTD's ability to detect ZSD-relevant disease signatures in Patient 10 (21-year-old), however, shows how CTD can assist clinicians in difficult diagnostic situations.
While the CTDdm score put Patient 10's module moderately far away from the main disease module (63rd percentile), the ZSD disease network modeled perturbation patterns in individuals with PEX1 defects from a variety of different levels of severity. It is therefore plausible that a mild PEX16 disease signature, while highly connected in the PEX1 network, was more distal to several metabolites perturbed in individuals with PEX1associated ZSD. To make the ZSD disease network more sensitive to disease signatures observed in each of the 14 PEX genes that cause ZSD, using profiling data from individuals with biallelic defects in any of several different PEX genes would be beneficial for network learning. While this diagnosis was further complicated by reduced plasma-derived perturbations with age observed in ZSDs, the fact that the diagnosis was missed by previous manual inspection of metabolomic data 22 , however, highlights the power of our data-driven network method to identify both the modeled (PEX1) and related disease gene's (PEX16) effect on the metabolome.

Discussion
Several recent publications have called for systems biology solutions to shortcomings observed in current diagnostic methods for metabolic disorders [36][37][38] . IEMs provide a useful context for testing novel computational approaches such as CTD because the genetic etiology of many IEMs is well-established. In this work, we have shown the accuracy of the CTD method using untargeted metabolomics on a variety of IEM data sets. Our results pave the way toward the dissection of the genetic etiology and precise diagnosis of more complex, metabolically heterogeneous diseases, such as diabetes and metabolic syndrome.
CTD characterizes individuals' metabolomic likeness to a given disease state based on the connectedness of metabolite perturbations in a disease-specific network. Analogous to the background knowledge clinicians accumulate about metabolism by combining textbook knowledge of biochemistry and experience in the clinic, disease-specific networks are learned directly from representative profiling data and reflect information found in well-curated pathway knowledge 39 . Network-based similarity metrics have been constructed previously for use in disease-disease similarity 40 , functional protein similarity 41 , and similarity in clinical ontological terms 42 . We apply similar logic to quantify patient-patient (CTDncd) or patient-disease (CTDdm) distances using metabolomics data and use both metrics as additional network-quantified predictors of diagnosis.
CTD makes few assumptions about the nature of disease-associated perturbations and is thus well-suited for discovery and diagnosis of hard-to-diagnose cases. For example, CTD does not make any hard assumptions about the metabolites involved in a particular disease or the directionality (+ or −) of perturbations. While information about the directionality of a metabolite perturbation can be useful for diagnostic discrimination, there are several situations where this information can be disadvantageous. For example, related disease states and disease genes frequently affect the same molecular components but in different ways 40,43 . Just as different variants identified in a single gene can lead to different levels of disease severity, modeling the effect that one disease gene has on a metabolic phenotype may not accurately predict the metabolic effect of another gene that is involved in the same pathway, particularly when it comes to the direction that specific metabolites are perturbed. By considering combinatorial patterns without regard to directionality, CTD-based metrics make fewer assumptions and are therefore more suitable for data-driven discovery and for resolving hard-to-diagnose cases. If routine discrimination and not discovery is the main goal, information about the directionality of metabolite perturbations can be combined with combinatorial information garnered in CTD-based metrics (see Supplemental Text 1). www.nature.com/scientificreports/ As is the case with any data-driven modeling approach, the performance of CTD-based metrics is mostly determined by the amount and quality of data available. Based on our experience, we recommend a minimum of 5 disease profiles, a minimum of 25 reference profiles for each disease condition, and the use of surrogate disease and negative control profiles 21 to learn stable disease-specific networks ( Figure S5). If the samples represent a spectrum of disease severity or are the result of defects from more than one gene, we recommend gathering even more unique examples of disease prior to network learning.
The paucity of profiles available for rare disorders similarly affects both our network-based and rule-based approaches. There are many metabolic disease states including several IEMs where the paucity of cases has precluded data-driven classification. When a new case of such diseases shows a different metabolic perturbation signature that is contrary to existing biomarker-based knowledge, the case is very likely to be falsely omitted from the correct diagnosis by rule-based algorithms 20 . This problem will only be alleviated with the accumulation of more case profiling data. One strength of the network-based strategy is that new data can be incorporated automatically by a structured network learning process. In contrast, for rule-based approaches, knowledge curation and optimization can often be laborious.
Network models may also be significantly affected by confounding factors. For example, networks learned from samples collected after treatment was initiated will inevitably connect both treatment-related and diseaserelated metabolites together. Treatment-related modules, if not pruned properly in the network pruning stage, can cause some individuals without the modeled disease to be falsely diagnosed. CTDdm is designed to identify false positive calls made by CTD alone (see Supplemental Text 1) but fails when the treatment-related module is also well-connected and/or proximal with the disease-related module in the network. Such a circumstance may  Table 4. Variant re-interpretations based on evidence quantified from the metabolome. 10/10 variant interpretations discovered by manually inspecting metabolomics data from 170 individuals in Alaimo et al. 22 were reproduced using our automated pipeline, where 9/10 of those had strong significance and 1/10 had borderline significance. One novel finding is also reported, where one individual was diagnosed with a PBD, highlighting the ability of CTD-based metrics to detect disease-relevant signatures that are too complex or subtle to detect using manual inspection. www.nature.com/scientificreports/ arise when nearly all the samples used in network learning for a particular disease were on treatment at the time their blood was sampled. Similarly, some treatments (e.g., metabolite supplementation, enzyme replacement therapy) have shown to successfully normalize disease-relevant metabolite perturbations in targeted, disease pathways. Network models trained on metabolomic profiles without disease-relevant perturbations will not be informative. On the other hand, some treatment-related signatures-especially when the treatment is highly specific to a particular IEM-can improve diagnostic accuracy in some circumstances, as these signatures can provide indirect evidence for the presence of disease. In summary, our work suggests that data-derived network models offer competitive diagnostic accuracy compared to rule-based biomarker modeling approaches, show improved performance as more data accumulates, and can replace pathway-based modeling approaches, particularly when the relevant pathway knowledge is unavailable.

Conclusions
The benefit of CTD-based metrics can be particularly powerful when applied to individuals who are undiagnosed by current methods. By quantifying the likeness of individuals' metabolite perturbations with perturbation patterns observed in many diseases, candidate diagnoses can be ranked and possible diagnoses can be recommended. Furthermore, if genetic sequencing data are available for an individual exhibiting strong disease-specific metabolite perturbation patterns, VUSs can be re-interpreted given the functional evidence provided by untargeted metabolomics. While CTD-based metrics cannot eliminate manual review entirely, they can expedite it and increase the confidence by which clinical laboratory directors make diagnostic decisions. Finally, diseasespecific network models can be automatically and continuously updated as new case profiling data accumulates, ensuring stronger network stability and improved diagnostic performance.

Data availability
Datasets related to this article are available within Supplemental Table 2 and accessible via the Metabolomics Data Portal Download Data tab ( Figure S4).